1 Introduction

## Loading the dataset
require("CASdatasets")
data("freMTPLfreq")

freMTPLfreq = subset(freMTPLfreq, Exposure <= 1 & Exposure >= 0 & CarAge <= 
    25)

set.seed(85)
require("caret")
folds = createDataPartition(freMTPLfreq$ClaimNb, 0.5)
dataset = freMTPLfreq[folds[[1]], ]

A good idea is to check whether the dataset has been loaded correctly. To do this, the following tools can be used:

  • head allows to visualize the first 6 lines of the dataset.
head(dataset)
##   PolicyID ClaimNb Exposure Power CarAge DriverAge
## 1        1       0     0.09     g      0        46
## 2        2       0     0.84     g      0        46
## 4        4       0     0.45     f      2        38
## 5        5       0     0.15     g      0        41
## 7        7       0     0.81     d      1        27
## 9        9       0     0.76     d      9        23
##                                Brand     Gas Region Density
## 1 Japanese (except Nissan) or Korean  Diesel    R72      76
## 2 Japanese (except Nissan) or Korean  Diesel    R72      76
## 4 Japanese (except Nissan) or Korean Regular    R31    3003
## 5 Japanese (except Nissan) or Korean  Diesel    R52      60
## 7 Japanese (except Nissan) or Korean Regular    R72     695
## 9                               Fiat Regular    R31    7887
  • str allows to see the format of the different variables. We will typically distinguish numerical variables (real numbers or integers) and factors (categorical data).
str(dataset)
## 'data.frame':    205432 obs. of  10 variables:
##  $ PolicyID : int  1 2 4 5 7 9 10 14 17 18 ...
##  $ ClaimNb  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Exposure : num  0.09 0.84 0.45 0.15 0.81 0.76 0.34 0.19 0.8 0.07 ...
##  $ Power    : Factor w/ 12 levels "d","e","f","g",..: 4 4 3 4 1 1 6 2 2 2 ...
##  $ CarAge   : int  0 0 2 0 1 9 0 0 0 0 ...
##  $ DriverAge: int  46 46 38 41 27 23 44 33 69 69 ...
##  $ Brand    : Factor w/ 7 levels "Fiat","Japanese (except Nissan) or Korean",..: 2 2 2 2 2 1 2 2 2 2 ...
##  $ Gas      : Factor w/ 2 levels "Diesel","Regular": 1 1 2 1 2 2 2 2 2 2 ...
##  $ Region   : Factor w/ 10 levels "R11","R23","R24",..: 9 9 5 6 9 5 1 1 1 1 ...
##  $ Density  : int  76 76 3003 60 695 7887 27000 1746 1376 1376 ...
  • summary allows to compute for each variable some summary statistics.
summary(dataset)
##     PolicyID         ClaimNb           Exposure            Power      
##  Min.   :     1   Min.   :0.00000   Min.   :0.002732   f      :47709  
##  1st Qu.:102981   1st Qu.:0.00000   1st Qu.:0.200000   g      :45390  
##  Median :206768   Median :0.00000   Median :0.530000   e      :38329  
##  Mean   :206581   Mean   :0.03914   Mean   :0.559871   d      :33720  
##  3rd Qu.:309942   3rd Qu.:0.00000   3rd Qu.:1.000000   h      :13346  
##  Max.   :413168   Max.   :4.00000   Max.   :1.000000   j      : 8962  
##                                                        (Other):17976  
##      CarAge         DriverAge    
##  Min.   : 0.000   Min.   :18.00  
##  1st Qu.: 3.000   1st Qu.:34.00  
##  Median : 7.000   Median :44.00  
##  Mean   : 7.417   Mean   :45.29  
##  3rd Qu.:12.000   3rd Qu.:54.00  
##  Max.   :25.000   Max.   :99.00  
##                                  
##                                 Brand             Gas        
##  Fiat                              :  8341   Diesel :102771  
##  Japanese (except Nissan) or Korean: 39413   Regular:102661  
##  Mercedes, Chrysler or BMW         :  9651                   
##  Opel, General Motors or Ford      : 18638                   
##  other                             :  4947                   
##  Renault, Nissan or Citroen        :108267                   
##  Volkswagen, Audi, Skoda or Seat   : 16175                   
##      Region         Density     
##  R24    :79767   Min.   :    2  
##  R11    :34778   1st Qu.:   67  
##  R53    :21032   Median :  287  
##  R52    :19321   Mean   : 1982  
##  R72    :15506   3rd Qu.: 1408  
##  R31    :13576   Max.   :27000  
##  (Other):21452

If one needs some help on a function, typing a question mark and the name of the function in the console opens the help file of the function. For instance,

?head

2 Descriptive Analysis of the portfolio

We will now have a descriptive analysis of the portfolio. The different variables available are PolicyID, ClaimNb, Exposure, Power, CarAge, DriverAge, Brand, Gas, Region, Density.

2.1 PolicyID

The variable PolicyID related to a unique identifier of the policy. We can check that every policy appears only once in the dataset

length(unique(dataset$PolicyID)) == nrow(dataset)
## [1] TRUE

2.2 Exposure in month

The Exposure reveals the fraction of the year during which the policyholder is in the portfolio. We can compute the total exposure by summing the policyholders’ exposures. Here we find 115015.5 years.

We can show the number of months of exposure on a table.

table(cut(dataset$Exposure, breaks = seq(from = 0, to = 1, by = 1/12), labels = 1:12))
## 
##     1     2     3     4     5     6     7     8     9    10    11    12 
## 31393 14729 16610 12040  9793 14680  9447  7248 10714  6838  6066 65874

Using the function prop.table, it is possible to represent this information in relative terms show the number of months of exposure on a table.

round(prop.table(table(cut(dataset$Exposure, breaks = seq(from = 0, to = 1, 
    by = 1/12), labels = 1:12))), 4)
## 
##      1      2      3      4      5      6      7      8      9     10 
## 0.1528 0.0717 0.0809 0.0586 0.0477 0.0715 0.0460 0.0353 0.0522 0.0333 
##     11     12 
## 0.0295 0.3207

Alternatively, we can use a barplot !

Exposure.summary = cut(dataset$Exposure, breaks = seq(from = 0, to = 1, by = 1/12))
levels(Exposure.summary) = 1:12
ggplot() + geom_bar(aes(x = Exposure.summary)) + xlab("Number of months") + 
    ggtitle("Exposure in months")

2.3 Number of claim : ClaimNb

ggplot(dataset, aes(x = ClaimNb)) + geom_bar() + geom_text(stat = "count", aes(label = ..count..), 
    vjust = -1) + ylim(c(0, 210000)) + ylab("") + xlab("Number of Claims") + 
    ggtitle("Proportion of policies by number of claims")

We can compute the average claim frequency in this portfolio, taking into account the different exposures.

sum(dataset$ClaimNb)/sum(dataset$Exposure)

Here, we obtain 0.0699.

Let us now look at the other variables.

2.4 Power

The variable Power is a categorized variable, related to the power of the car. The levels of the variable are ordered categorically. We can see the different levels of a factor by using the function level in R:

levels(dataset$Power)
##  [1] "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o"

We can see the number of observations in each level of the variable, by using the function table.

table(dataset$Power)
## 
##     d     e     f     g     h     i     j     k     l     m     n     o 
## 33720 38329 47709 45390 13346  8793  8962  4659  2255   905   624   740

Remember however, that in insurance, exposures may differ from one policyholder to another. Hence, the table above, does NOT measure the exposure in each level of the variable Power. We can use the function ddply to give us the exposure in each level of the variable.

require(plyr)
Power.summary = ddply(dataset, .(Power), summarize, totalExposure = sum(Exposure), 
    Number.Observations = length(Exposure))

We can show this on a plot as well:

ggplot(Power.summary, aes(x = Power, y = totalExposure, fill = Power)) + geom_bar(stat = "identity") + 
    ylab("Exposure in years") + geom_text(stat = "identity", aes(label = round(totalExposure, 
    0), color = Power), vjust = -0.5) + guides(fill = FALSE, color = FALSE)

Let us now look at the observed claim frequency in each level

Power.summary = ddply(dataset, .(Power), summarize, totalExposure = sum(Exposure), 
    Number.Observations = length(Exposure), Number.Claims = sum(ClaimNb), Obs.Claim.Frequency = sum(ClaimNb)/sum(Exposure))
Power.summary
Power totalExposure Number.Observations Number.Claims Obs.Claim.Frequency
d 18711.53 33720 1148 0.06135
e 22157.35 38329 1631 0.07361
f 27823.54 47709 1998 0.07181
g 25678.47 45390 1672 0.06511
h 6956.80 13346 505 0.07259
i 4680.11 8793 369 0.07884
j 4587.50 8962 357 0.07782
k 2247.44 4659 188 0.08365
l 1046.09 2255 78 0.07456
m 475.74 905 37 0.07777
n 317.64 624 31 0.09760
o 333.32 740 26 0.07800

We can compute the ratio to the portfolio claim frequency.

portfolio.cf = sum(dataset$ClaimNb)/sum(dataset$Exposure)

ggplot(Power.summary) + geom_bar(stat = "identity", aes(x = Power, y = Obs.Claim.Frequency, 
    fill = Power)) + geom_line(aes(x = as.numeric(Power), y = portfolio.cf), 
    color = "red") + guides(fill = FALSE)

2.5 CarAge

The vehicle age, in years. This is the first continuous variable that we encounter (although it only takes discrete values).

ggplot(dataset, aes(x = CarAge)) + geom_bar() + xlab("Age of the Car")

Again, here, the exposures are not considered on the histogram. We can use ddply to correct this.

CarAge.summary = ddply(dataset, .(CarAge), summarize, totalExposure = sum(Exposure), 
    Number.Observations = length(Exposure))
CarAge.summary
CarAge totalExposure Number.Observations
0 4346.42 14947
1 9016.91 18863
2 8621.68 16246
3 7902.55 14337
4 7438.18 12851
5 7206.40 11889
6 6920.09 11245
7 6538.41 10516
8 6636.01 10624
9 6319.48 10268
10 6931.74 12168
11 6051.96 9727
12 5819.12 9544
13 5622.09 9169
14 4938.09 8244
15 4349.96 7674
16 3099.81 5156
17 2375.68 3950
18 1713.78 2822
19 1175.23 1931
20 732.41 1231
21 470.92 781
22 316.97 505
23 207.92 328
24 145.56 219
25 118.17 197

Then, we can plot the data onto a barplot, as before.

ggplot(CarAge.summary, aes(x = CarAge, y = totalExposure)) + geom_bar(stat = "identity") + 
    ylab("Exposure in years")

We can see a large difference, specially for new cars, which makes sense ! Indeed, let us look at the Exposure for new vehicles, using a boxplot for instance.

ggplot(dataset[dataset$CarAge == 0, ], aes(x = "Exposure", y = Exposure)) + 
    geom_boxplot() + ggtitle("Exposure of new cars")

Let us now also compute the claim frequency by age of car,

CarAge.summary = ddply(dataset, .(CarAge), summarize, totalExposure = sum(Exposure), 
    Number.Observations = length(Exposure), Number.Claims = sum(ClaimNb), Obs.Claim.Freq = sum(ClaimNb)/sum(Exposure))
CarAge.summary
CarAge totalExposure Number.Observations Number.Claims Obs.Claim.Freq
0 4346.42 14947 285 0.06557
1 9016.91 18863 653 0.07242
2 8621.68 16246 591 0.06855
3 7902.55 14337 551 0.06972
4 7438.18 12851 528 0.07099
5 7206.40 11889 501 0.06952
6 6920.09 11245 509 0.07355
7 6538.41 10516 502 0.07678
8 6636.01 10624 479 0.07218
9 6319.48 10268 487 0.07706
10 6931.74 12168 527 0.07603
11 6051.96 9727 441 0.07287
12 5819.12 9544 449 0.07716
13 5622.09 9169 369 0.06563
14 4938.09 8244 330 0.06683
15 4349.96 7674 273 0.06276
16 3099.81 5156 197 0.06355
17 2375.68 3950 140 0.05893
18 1713.78 2822 88 0.05135
19 1175.23 1931 48 0.04084
20 732.41 1231 39 0.05325
21 470.92 781 20 0.04247
22 316.97 505 11 0.03470
23 207.92 328 6 0.02886
24 145.56 219 9 0.06183
25 118.17 197 7 0.05924

and plot it!

ggplot(CarAge.summary, aes(x = CarAge, y = Obs.Claim.Freq)) + geom_point() + 
    ylab("Observed Claim Frequency") + xlab("Age of the Car") + ylim(c(0, 0.08))

2.6 DriverAge

Similarly to the Age of the Car, we can visualize the Age of the Drivers.

DriverAge.summary = ddply(dataset, .(DriverAge), summarize, totalExposure = sum(Exposure), 
    Number.Observations = length(Exposure), Number.Claims = sum(ClaimNb), Obs.Claim.Freq = sum(ClaimNb)/sum(Exposure))
head(DriverAge.summary, 9)
DriverAge totalExposure Number.Observations Number.Claims Obs.Claim.Freq
18 78.37 257 23 0.29347
19 314.15 811 90 0.28648
20 492.48 1213 112 0.22742
21 637.28 1528 105 0.16476
22 794.64 1822 122 0.15353
23 923.73 2074 124 0.13424
24 1073.46 2434 111 0.10340
25 1234.65 2765 114 0.09233
26 1476.38 3240 157 0.10634

We can show the Exposures by Age of the Driver

ggplot(DriverAge.summary, aes(x = DriverAge, y = totalExposure)) + geom_bar(stat = "identity", 
    width = 0.8) + ylab("Exposure in years") + xlab("Age of the Driver")

and the observed claim frequency by Age.

ggplot(DriverAge.summary, aes(x = DriverAge, y = Obs.Claim.Freq)) + geom_point() + 
    ylab("Observed Claim Frequency") + xlab("Age of the Driver")

2.7 Brand

The variable Brand is a categorized variable, related to the brand of the car. We can see the different levels of a factor by using the function level in R:

levels(dataset$Brand)
## [1] "Fiat"                              
## [2] "Japanese (except Nissan) or Korean"
## [3] "Mercedes, Chrysler or BMW"         
## [4] "Opel, General Motors or Ford"      
## [5] "other"                             
## [6] "Renault, Nissan or Citroen"        
## [7] "Volkswagen, Audi, Skoda or Seat"
Brand.summary = ddply(dataset, .(Brand), summarize, totalExposure = sum(Exposure), 
    Number.Observations = length(Exposure), Number.Claims = sum(ClaimNb), Obs.Claim.Freq = sum(ClaimNb)/sum(Exposure))
Brand.summary
Brand totalExposure Number.Observations Number.Claims Obs.Claim.Freq
Fiat 4728.26 8341 357 0.07550
Japanese (except Nissan) or Korean 15499.62 39413 993 0.06407
Mercedes, Chrysler or BMW 5254.51 9651 416 0.07917
Opel, General Motors or Ford 10812.52 18638 847 0.07834
other 2905.09 4947 232 0.07986
Renault, Nissan or Citroen 66756.93 108267 4446 0.06660
Volkswagen, Audi, Skoda or Seat 9058.60 16175 749 0.08268
require(ggplot2)
ggplot(Brand.summary, aes(x = reorder(Brand, totalExposure), y = totalExposure, 
    fill = Brand)) + geom_bar(stat = "identity") + coord_flip() + guides(fill = FALSE) + 
    xlab("") + ylab("Exposure in years")

Let us now look at the claim frequency by Brand of the car.

ggplot(Brand.summary, aes(x = reorder(Brand, Obs.Claim.Freq), y = Obs.Claim.Freq, 
    fill = Brand)) + geom_bar(stat = "identity") + coord_flip() + guides(fill = FALSE) + 
    ggtitle("Observed Claim Frequencies by Brand of the car") + xlab("") + ylab("Observed Claim Frequency")

2.8 Gas

The variable Gas is a categorized variable, related to the fuel of the car. We can see the different levels of a factor by using the function level in R:

levels(dataset$Gas)
## [1] "Diesel"  "Regular"
Gas.summary = ddply(dataset, .(Gas), summarize, totalExposure = sum(Exposure), 
    Number.Observations = length(Exposure), Number.Claims = sum(ClaimNb), Obs.Claim.Freq = sum(ClaimNb)/sum(Exposure))
ggplot(Gas.summary, aes(x = Gas, y = totalExposure, fill = Gas)) + geom_bar(stat = "identity") + 
    guides(fill = FALSE)

There seems to be a similar amount of Diesel and Regular gas vehicles in the portfolio. It is generally expected that Diesel have a higher claim frequency. Does this also hold on our dataset ?

ggplot(Gas.summary, aes(x = Gas, y = Obs.Claim.Freq, fill = Gas)) + geom_bar(stat = "identity") + 
    guides(fill = FALSE)

2.9 Region

The variable Region is a categorized variable, related to the region of the place of residence. We can see the different levels of a factor by using the function level in R:

levels(dataset$Region)
##  [1] "R11" "R23" "R24" "R25" "R31" "R52" "R53" "R54" "R72" "R74"

What are the Exposures in each region ? What are the observed claim frequencies ?

Region.summary = ddply(dataset, .(Region), summarize, totalExposure = sum(Exposure), 
    Number.Observations = length(Exposure), Number.Claims = sum(ClaimNb), Obs.Claim.Freq = sum(ClaimNb)/sum(Exposure))
Region.summary
Region totalExposure Number.Observations Number.Claims Obs.Claim.Freq
R11 15024.48 34778 1266 0.08426
R23 1531.38 4275 94 0.06138
R24 50927.07 79767 3289 0.06458
R25 3274.99 5398 227 0.06931
R31 5676.75 13576 483 0.08508
R52 10871.82 19321 760 0.06991
R53 13889.48 21032 928 0.06681
R54 5573.92 9455 386 0.06925
R72 7033.85 15506 516 0.07336
R74 1211.77 2324 91 0.07510

Using the function twoord.plot we can easily show both the Exposures and the observed claim frequencies on the same plot.

twoord.plot(1:10, Region.summary$totalExposure, 1:10, Region.summary$Obs.Claim.Freq, 
    xlab = "Region", rylim = c(0, 0.1), type = c("bar", "p"), xticklab = Region.summary$Region, 
    ylab = "Exposure", rylab = "Observed Claim Frequency")

We can plot a map with the observed claim frequencies

library(maptools)
library(ggplot2)
area <- readShapePoly("shapefiles/FRA_adm2.shp")  # From http://www.diva-gis.org/gData

Region.summary$id = sapply(Region.summary$Region, substr, 2, 3)
area.points = fortify(area, region = "ID_2")

area.points = merge(area.points, Region.summary[, c("id", "totalExposure", "Obs.Claim.Freq")], 
    by.x = "id", by.y = "id", all.x = TRUE)
area.points = area.points[order(area.points$order), ]



ggplot(area.points, aes(long, lat, group = group)) + ggtitle("Observed Claim Frequencies") + 
    geom_polygon(aes(fill = area.points$Obs.Claim.Freq)) + scale_fill_gradient(low = "yellow", 
    high = "red", name = "Obs. Claim Freq.", limits = c(0.061, 0.085)) + xlab("Longitude") + 
    ylab("Latitude")

and the exposures (on a log-scale)…

ggplot(area.points, aes(long, lat, group = group)) + ggtitle("log Exposures in years") + 
    geom_polygon(aes(fill = log(area.points$totalExposure))) + scale_fill_gradient(low = "blue", 
    high = "red", name = "log Exposure") + xlab("Longitude") + ylab("Latitude")

2.10 Density

The Density represents here the density of the population at the place of residence. Let us take a look at the densities in the dataset.

summary(dataset$Density)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2      67     287    1982    1408   27000
ggplot(dataset, aes(Density)) + geom_histogram(bins = 200)

Here, contrary to the age of the driver, or the age of the car, the density has lots of different values

length(unique(dataset$Density))

We can compute this by using the command above, and we get 1267.

Density.summary = ddply(dataset, .(Density), summarize, totalExposure = sum(Exposure), 
    Number.Observations = length(Exposure), Number.Claims = sum(ClaimNb), Obs.Claim.Freq = sum(ClaimNb)/sum(Exposure))
head(Density.summary)
Density totalExposure Number.Observations Number.Claims Obs.Claim.Freq
2 9.07 13 1 0.11025
3 58.94 81 2 0.03393
4 18.22 29 1 0.05490
5 67.99 122 7 0.10295
6 131.76 207 8 0.06072
7 182.77 288 7 0.03830

We can plot the observed claim frequencies…

ggplot(Density.summary, aes(x = Density, y = Obs.Claim.Freq)) + geom_point()

… but realize it is impossible to see a trend. One way out is to categorize the variable. We will see later (GAM) that it is possible to estimate a smooth function, which avoid the arbitrary categorization.

We can categorize the variable using the function cut.

dataset$DensityCAT = cut(dataset$Density, breaks = quantile(dataset$Density, 
    probs = seq(from = 0, to = 1, by = 0.1)), include.lowest = TRUE)
table(dataset$DensityCAT)
## 
##             [2,28]            (28,51]            (51,91] 
##              20813              20571              20805 
##           (91,158]          (158,287]          (287,554] 
##              19986              20584              20508 
##     (554,1.16e+03] (1.16e+03,2.4e+03] (2.4e+03,4.35e+03] 
##              20571              20541              21290 
## (4.35e+03,2.7e+04] 
##              19763
levels(dataset$DensityCAT) <- LETTERS[1:10]

Then, we can apply the same strategy as above.

Density.summary = ddply(dataset, .(DensityCAT), summarize, totalExposure = sum(Exposure), 
    Number.Observations = length(Exposure), Number.Claims = sum(ClaimNb), Obs.Claim.Freq = sum(ClaimNb)/sum(Exposure))
Density.summary
DensityCAT totalExposure Number.Observations Number.Claims Obs.Claim.Freq
A 13499.50 20813 678 0.05022
B 12959.15 20571 738 0.05695
C 12695.77 20805 769 0.06057
D 12001.69 19986 754 0.06282
E 12166.47 20584 798 0.06559
F 11918.67 20508 845 0.07090
G 11065.29 20571 904 0.08170
H 10670.17 20541 897 0.08407
I 9707.02 21290 922 0.09498
J 8331.80 19763 735 0.08822

Using the function twoord.plot we can easily show both the Exposures and the observed claim frequencies on the same plot.

twoord.plot(1:10, Density.summary$totalExposure, 1:10, Density.summary$Obs.Claim.Freq, 
    xlab = "Density (categorized)", lylim = c(0, 15000), rylim = c(0, 0.15), 
    type = c("bar", "p"), xticklab = Density.summary$Density, ylab = "Exposure", 
    rylab = "Observed Claim Frequency", lytickpos = seq(0, 15000, 5000), rytickpos = seq(0, 
        0.15, 0.03))